People are very sensitive to the quality of the speech they hear (Bailly, 2003). High-quality conversational IVR applications primarily use recordings of professional voice talents for the system voice, sometimes supplemented with artificial speech (text-to-speech, or TTS) for unbounded text (text which is difficult or impossible to predict -- e.g., new book or movie titles). Lower-cost conversational systems may rely exclusively on TTS (e.g., in-vehicle or mobile devices). Research on a standardized assessment questionnaire (the MOS-X – Polkosky & Lewis, 2003) indicates four components of user satisfaction with speech output: Intelligibility, Naturalness, Prosody, and Social Impression.
Bailly, G. (2003). Close shadowing natural versus synthetic speech. International Journal of Speech Technology, 6, 11–19.
Polkosky, M. D., & Lewis, J. R. (2003). Expanding the MOS: Development and psychometric evaluation of the MOS-R and MOS-X. International Journal of Speech Technology, 6, 161–182.
Voice Talent
Prosody
TTS
Audio Recording Considerations
References
Bailly, G. (2003). Close shadowing natural versus synthetic speech. International Journal of Speech Technology, 6, 11–19.Polkosky, M. D., & Lewis, J. R. (2003). Expanding the MOS: Development and psychometric evaluation of the MOS-R and MOS-X. International Journal of Speech Technology, 6, 161–182.